Note: Blue boxes describe conceptuality, add context to answers, and provide a story. Is Sia better than Lady Gaga? Yes. And I'll use this data to make my case.
Answer: Green boxes are direct answers to questions, mostly found at the end of the notebook.
Unstyled markdown denotes sections in the Exploratory-Data-Analysis-Assignment-2024.docx file, this notebook does not follow the same order as the Word Doc, you can search for sections like B.b.iv. or the sentence that describes the requirement.¶
In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans
B.a. You must read in the data associated with your project.¶
In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
df
Out[3]:
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre ... key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
0 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxur... Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx I Don't Care (with Justin Bieber) [Loud Luxury... 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop ... 6 -2.634 1 0.0583 0.102000 0.000000 0.0653 0.5180 122.036 194754
1 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop ... 11 -4.969 1 0.0373 0.072400 0.004210 0.3570 0.6930 99.972 162600
2 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop ... 1 -3.432 0 0.0742 0.079400 0.000023 0.1100 0.6130 124.008 176616
3 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop ... 7 -3.778 1 0.1020 0.028700 0.000009 0.2040 0.2770 121.956 169093
4 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop ... 1 -4.672 1 0.0359 0.080300 0.000000 0.0833 0.7250 123.976 189052
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32828 7bxnKAamR3snQ1VGLuVfC1 City Of Lights - Official Radio Edit Lush & Simon 42 2azRoBBWEEEYhqV6sb7JrT City Of Lights (Vocal Mix) 2014-04-28 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux edm ... 2 -1.814 1 0.0936 0.076600 0.000000 0.0668 0.2100 128.170 204375
32829 5Aevni09Em4575077nkWHz Closer - Sultan & Ned Shepard Remix Tegan and Sara 20 6kD6KLxj7s8eCE3ABvAyf5 Closer Remixed 2013-03-08 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux edm ... 0 -4.462 1 0.0420 0.001710 0.004270 0.3750 0.4000 128.041 353120
32830 7ImMqPP3Q1yfUHvsdn7wEo Sweet Surrender - Radio Edit Starkillers 14 0ltWNSY9JgxoIZO4VzuCa6 Sweet Surrender (Radio Edit) 2014-04-21 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux edm ... 6 -4.899 0 0.0481 0.108000 0.000001 0.1500 0.4360 127.989 210112
32831 2m69mhnfQ1Oq6lGtXuYhgX Only For You - Maor Levi Remix Mat Zo 15 1fGrOkHnHJcStl14zNx8Jy Only For You (Remixes) 2014-01-01 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux edm ... 2 -3.361 1 0.1090 0.007920 0.127000 0.3430 0.3080 128.008 367432
32832 29zWqhca3zt5NsckZqDf6c Typhoon - Original Mix Julian Calor 27 0X3mUOm6MhxR7PzxG95rAo Typhoon/Storm 2014-03-03 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux edm ... 5 -4.571 0 0.0385 0.000133 0.341000 0.7420 0.0894 127.984 337500

32833 rows × 23 columns

B.b.ii. Display the column names and their associated data types.¶
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32833 entries, 0 to 32832
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_id                  32833 non-null  object 
 1   track_name                32828 non-null  object 
 2   track_artist              32828 non-null  object 
 3   track_popularity          32833 non-null  int64  
 4   track_album_id            32833 non-null  object 
 5   track_album_name          32828 non-null  object 
 6   track_album_release_date  32833 non-null  object 
 7   playlist_name             32833 non-null  object 
 8   playlist_id               32833 non-null  object 
 9   playlist_genre            32833 non-null  object 
 10  playlist_subgenre         32833 non-null  object 
 11  danceability              32833 non-null  float64
 12  energy                    32833 non-null  float64
 13  key                       32833 non-null  int64  
 14  loudness                  32833 non-null  float64
 15  mode                      32833 non-null  int64  
 16  speechiness               32833 non-null  float64
 17  acousticness              32833 non-null  float64
 18  instrumentalness          32833 non-null  float64
 19  liveness                  32833 non-null  float64
 20  valence                   32833 non-null  float64
 21  tempo                     32833 non-null  float64
 22  duration_ms               32833 non-null  int64  
dtypes: float64(9), int64(4), object(10)
memory usage: 5.8+ MB
In [5]:
df.describe()
Out[5]:
track_popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
count 32833.000000 32833.000000 32833.000000 32833.000000 32833.000000 32833.000000 32833.000000 32833.000000 32833.000000 32833.000000 32833.000000 32833.000000 32833.000000
mean 42.477081 0.654850 0.698619 5.374471 -6.719499 0.565711 0.107068 0.175334 0.084747 0.190176 0.510561 120.881132 225799.811622
std 24.984074 0.145085 0.180910 3.611657 2.988436 0.495671 0.101314 0.219633 0.224230 0.154317 0.233146 26.903624 59834.006182
min 0.000000 0.000000 0.000175 0.000000 -46.448000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4000.000000
25% 24.000000 0.563000 0.581000 2.000000 -8.171000 0.000000 0.041000 0.015100 0.000000 0.092700 0.331000 99.960000 187819.000000
50% 45.000000 0.672000 0.721000 6.000000 -6.166000 1.000000 0.062500 0.080400 0.000016 0.127000 0.512000 121.984000 216000.000000
75% 62.000000 0.761000 0.840000 9.000000 -4.645000 1.000000 0.132000 0.255000 0.004830 0.248000 0.693000 133.918000 253585.000000
max 100.000000 0.983000 1.000000 11.000000 1.275000 1.000000 0.918000 0.994000 0.994000 0.996000 0.991000 239.440000 517810.000000
Note: We can derive the number of rows and columns from above, however, let's be overly specific with the cell below.
B.b.i. Display the number of rows and columns.¶
In [6]:
print(f'There are {df.shape[0]} rows in the dataframe.')
print(f'There are {df.shape[1]} columns in the dataframe.')
There are 32833 rows in the dataframe.
There are 23 columns in the dataframe.
B.b.iii. Display the number of missing values for each column.¶
Note: We can also derive the answer to the above question from the info method, but let's be a little more clear.
In [7]:
df.isna().sum()
Out[7]:
track_id                    0
track_name                  5
track_artist                5
track_popularity            0
track_album_id              0
track_album_name            5
track_album_release_date    0
playlist_name               0
playlist_id                 0
playlist_genre              0
playlist_subgenre           0
danceability                0
energy                      0
key                         0
loudness                    0
mode                        0
speechiness                 0
acousticness                0
instrumentalness            0
liveness                    0
valence                     0
tempo                       0
duration_ms                 0
dtype: int64
B.b.iv. Display the number of unique values for each column.¶
In [8]:
df.nunique().sort_values()
Out[8]:
mode                            2
playlist_genre                  6
key                            12
playlist_subgenre              24
track_popularity              101
playlist_name                 449
playlist_id                   471
danceability                  822
energy                        952
speechiness                  1270
valence                      1362
liveness                     1624
acousticness                 3731
track_album_release_date     4530
instrumentalness             4729
loudness                    10222
track_artist                10692
tempo                       17684
track_album_name            19743
duration_ms                 19785
track_album_id              22545
track_name                  23449
track_id                    28356
dtype: int64
In [9]:
primary_categorical_variables = df.nunique().sort_values()[0:5].index.to_list()
primary_categorical_variables
Out[9]:
['mode', 'playlist_genre', 'key', 'playlist_subgenre', 'track_popularity']
In [10]:
primary_continuous_variables = df.nunique().sort_values()[5:-1].index.to_list()
primary_continuous_variables = [col for col in primary_continuous_variables if df[col].dtype != 'object']
primary_continuous_variables
Out[10]:
['danceability',
 'energy',
 'speechiness',
 'valence',
 'liveness',
 'acousticness',
 'instrumentalness',
 'loudness',
 'tempo',
 'duration_ms']
In [11]:
all_primaries = primary_categorical_variables + primary_continuous_variables
Note: I'm really only concerned with the columns listed above in primary_categorical_variables and primary_continuous_variables. Removing stuff like track_name because it is an object with too many unique values.
B.c. You must state whether you want to effectively treat a numeric column as a non-numeric for exploration purposes.¶
B.d. You do NOT need to display the COUNTS for categorical variables because you will visually display the variables. However, you may display/print COUNTS if it helps you.¶
Note: When looking at the output above we not only get unique counts but I have learned that Integer and Floats can be treated categorically if the number of unique values is small, roughly under 15 (subjective), ending above at the "playlist_subgenre" column. So it appears, we have, at most 4 categorical variables "mode", "playlist_genre", "key", and "playlist_subgenre". The rest of the variables will be treated continuously.

Although, I am interested in some of the variables that I would assume were categorical, but are not, stuff like "danceability", "track_popularity", "energy".

In [12]:
df[['danceability', 'track_popularity', 'energy']]
Out[12]:
danceability track_popularity energy
0 0.748 66 0.916
1 0.726 67 0.815
2 0.675 70 0.931
3 0.718 60 0.930
4 0.650 69 0.833
... ... ... ...
32828 0.428 42 0.922
32829 0.522 20 0.786
32830 0.529 14 0.821
32831 0.626 15 0.888
32832 0.603 27 0.884

32833 rows × 3 columns

In [13]:
def categorize_danceablity(danceability):
    if danceability > .75:
        return "high"
    if .50 < danceability < .75:
        return "medium"
    if  .25 < danceability < 0.50:
        return "low"
    if  danceability < 0.25:
        return "elevator"

df['_danceability_categories'] = df['danceability'].apply(lambda danceability: categorize_danceablity(danceability))
In [14]:
df['_danceability_categories'].value_counts()
Out[14]:
_danceability_categories
medium      18791
high         9040
low          4672
elevator      215
Name: count, dtype: int64
Note: Feature engineering for a broad categorization of ordinal danceability, could do this for other similar variables. I prepend engineered features with an underscore.
B.e. You must visualize the MARGINAL distributions for ALL variables in your data.¶
B.e.ii. Categorical variables: Bar charts¶
Note: Defining a function to facilitate code reuse. For continuous variables I will copy and paste code since I will not be creating a custom subplot.
In [15]:
def visualize_categories(x, title):
    sub_df = df[x].value_counts()
    fig, ax = plt.subplots(1, 2, figsize=(15, 5))
    colors_cycle = plt.cm.get_cmap('tab20', len(sub_df))
    
    if len(sub_df) > 20:
        ax[0].barh(sub_df.index, sub_df, color=[colors_cycle(i) for i in range(len(sub_df))])
        ax[0].set_xlabel('Count')
        ax[0].set_ylabel(title)
    else:
        ax[0].bar(sub_df.index, sub_df, color=[colors_cycle(i) for i in range(len(sub_df))])
        ax[0].set_xlabel(title)
        ax[0].set_ylabel('Count')

    ax[0].grid(True)
    
    ax[1].pie(sub_df.values, labels=sub_df.index, autopct='%1.1f%%', startangle=140,
              colors=[colors_cycle(i) for i in range(len(sub_df))])

    plt.show()
In [16]:
visualize_categories('mode', 'Spotify Modes')
No description has been provided for this image
In [17]:
visualize_categories('playlist_genre', 'Playlist Genre')
No description has been provided for this image
In [18]:
visualize_categories('key', 'Key')
No description has been provided for this image
In [19]:
visualize_categories('playlist_subgenre', 'Playlist Subgenere')
No description has been provided for this image
In [20]:
visualize_categories('_danceability_categories', 'Danceability Categories')
No description has been provided for this image
Note: I generally start with categorical data and then move to continuous data.
B.e.i. Continuous variables: Histograms or density plots.¶
In [21]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=df, x='track_popularity', kde=True)
plot.set_xlabel('Track Popularity')
plot.grid(True)
No description has been provided for this image
In [22]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=df, x='danceability', kde=True)
plot.set_xlabel('Danceability')
plot.set_ylabel('Number of Tracks')
plot.grid(True)
No description has been provided for this image
In [23]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=df, x='energy', kde=True)
plot.set_xlabel('Energy')
plot.set_ylabel('Number of Tracks')
plot.grid(True)
No description has been provided for this image
In [24]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=df, x='loudness', kde=True)
plot.set_xlabel('Loudness')
plot.set_ylabel('Number of Tracks')
plot.grid(True)
No description has been provided for this image
In [25]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=df, x='speechiness', kde=True)
plot.set_xlabel('Speechiness')
plot.set_ylabel('Number of Tracks')
plot.grid(True)
No description has been provided for this image
In [26]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=df, x='acousticness', kde=True)
plot.set_xlabel('Acousticness')
plot.set_ylabel('Number of Tracks')
plot.grid(True)
No description has been provided for this image
In [27]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=df, x='instrumentalness', kde=True)
plot.set_xlabel('Instrumentalness')
plot.set_ylabel('Number of Tracks')
plot.grid(True)
No description has been provided for this image
In [28]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=df, x='liveness', kde=True)
plot.set_xlabel('Liveness')
plot.set_ylabel('Number of Tracks')
plot.grid(True)
No description has been provided for this image
In [29]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=df, x='valence', kde=True)
plot.set_xlabel('Valence')
plot.set_ylabel('Number of Tracks')
plot.grid(True)
No description has been provided for this image
Note: Valence is a measurement of emotion in which 0 is closer to music that might be described as sad, depressed, angry, or melancholic. 1 is closer to music that might be described as happy, joyful, euphoric, or cheerful.

Well that's funny, let's deviate and see if we can find something that illustrates that.

In [30]:
'Sia' in df['track_artist'].values
Out[30]:
True
In [31]:
df[df['track_artist'] == 'Sia'].head()
Out[31]:
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre ... loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms _danceability_categories
45 3C4fzUgIP3ahkhOBdP38GM Heaven Can Wait - The Aston Shuffle Remix Sia 61 6fUjIqkNJY56xS62ild7D1 Heaven Can Wait (The Aston Shuffle Remix) 2019-05-24 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop ... -7.626 1 0.0636 0.000609 0.746000 0.0909 0.442 121.968 218373 medium
579 1TVEPho7uQub2pMJyzO8JD No New Friends - Dombresky Remix Sia 55 32xFYr17qgtcvVsDqVZHf7 No New Friends (Remixes) 2019-05-16 Dance Pop Tunes 4SdfG4cPG3skmTiQLozZGh pop ... -4.535 1 0.0657 0.026400 0.812000 0.0953 0.553 125.009 246720 high
581 2VDTgXwf3F6kcpX4HI0BKy Heaven Can Wait (feat. Sia, Diplo, and Labrinth) Sia 62 0ujHQ5WCLuKJQXOqXpGtpf LABRINTH, SIA & DIPLO PRESENT... LSD 2019-04-12 Dance Pop Tunes 4SdfG4cPG3skmTiQLozZGh pop ... -5.915 0 0.0817 0.034000 0.000026 0.1390 0.533 167.811 195747 medium
1206 27SdWb2rFzO6GWiYDBTD9j Cheap Thrills Sia 77 77jAfTh3KH9K2reMOmTgOh This Is Acting 2016-01-29 Pop Inglés (2020 - 2010s)💙 Música En Inglés 2010s 4QAzO1Z92yqqD6Mf3mOANu pop ... -5.608 0 0.1050 0.047200 0.001430 0.0907 0.732 89.976 211667 medium
1221 4VrWlk8IQxevMvERoX08iC Chandelier Sia 79 3xFSl9lIRaYXIYkIn3OIl9 1000 Forms Of Fear 2014-07-04 Pop Inglés (2020 - 2010s)💙 Música En Inglés 2010s 4QAzO1Z92yqqD6Mf3mOANu pop ... -2.880 1 0.0499 0.019700 0.000061 0.0685 0.572 117.089 216120 low

5 rows × 24 columns

In [32]:
df[df['track_name'] == 'Cheap Thrills']['valence']
Out[32]:
1206     0.732
2211     0.732
2566     0.732
2686     0.728
3775     0.732
20507    0.728
23903    0.732
30457    0.728
30637    0.732
Name: valence, dtype: float64
In [33]:
df[df['track_name'] == 'Cheap Thrills']
Out[33]:
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre ... loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms _danceability_categories
1206 27SdWb2rFzO6GWiYDBTD9j Cheap Thrills Sia 77 77jAfTh3KH9K2reMOmTgOh This Is Acting 2016-01-29 Pop Inglés (2020 - 2010s)💙 Música En Inglés 2010s 4QAzO1Z92yqqD6Mf3mOANu pop ... -5.608 0 0.105 0.0472 0.001430 0.0907 0.732 89.976 211667 medium
2211 27SdWb2rFzO6GWiYDBTD9j Cheap Thrills Sia 77 77jAfTh3KH9K2reMOmTgOh This Is Acting 2016-01-29 Post Teen Pop 2Aud8PXoTGcyY5esHu4Aaz pop ... -5.608 0 0.105 0.0472 0.001430 0.0907 0.732 89.976 211667 medium
2566 27SdWb2rFzO6GWiYDBTD9j Cheap Thrills Sia 77 77jAfTh3KH9K2reMOmTgOh This Is Acting 2016-01-29 Electropop Hits 2017-2020 7kyvBmlc1uSqsTL0EuNLrx pop ... -5.608 0 0.105 0.0472 0.001430 0.0907 0.732 89.976 211667 medium
2686 378iszndTZAR4dH8kwsLC6 Cheap Thrills Sia 63 3Xz1Bcn4YP9bdqlsstMfuJ This Is Acting 2016-02-18 ELECTROPOP 2UsEj2GUukV0GLbsE3rldz pop ... -4.931 0 0.215 0.0561 0.000002 0.0775 0.728 89.972 224813 medium
3775 7BPw9h8vCIX94N0rjFU12S Cheap Thrills Sia 61 3Xz1Bcn4YP9bdqlsstMfuJ This Is Acting 2016-02-18 10er Playlist 1kEczIkZH8IgaWT2BiApxZ pop ... -5.608 0 0.105 0.0472 0.001430 0.0907 0.732 89.976 211667 medium
20507 6b3b7lILUJqXcp6w9wNQSm Cheap Thrills Sia 68 4BTlXiDFjyJfpHjR7jlEJo Cheap Thrills 2016-02-11 Latin/Hip Hop/Dancehall/Soca 2rg9LCyvaMQvnCE2hVndpR latin ... -4.931 0 0.215 0.0561 0.000002 0.0775 0.728 89.972 224813 medium
23903 27SdWb2rFzO6GWiYDBTD9j Cheap Thrills Sia 77 77jAfTh3KH9K2reMOmTgOh This Is Acting 2016-01-29 Today's Hits (Clean) 7ENISpOJhocpMJVcGb0qcT r&b ... -5.608 0 0.105 0.0472 0.001430 0.0907 0.732 89.976 211667 medium
30457 6b3b7lILUJqXcp6w9wNQSm Cheap Thrills Sia 68 4BTlXiDFjyJfpHjR7jlEJo Cheap Thrills 2016-02-11 2010 - 2011 - 2012 - 2013 - 2014 - 2015 - 2016... 2DjIfVDXGYDgRxw7IJTKVb edm ... -4.931 0 0.215 0.0561 0.000002 0.0775 0.728 89.972 224813 medium
30637 27SdWb2rFzO6GWiYDBTD9j Cheap Thrills Sia 77 77jAfTh3KH9K2reMOmTgOh This Is Acting 2016-01-29 2015 songs 6UJw1egIcZVfrBmcKs5uHH edm ... -5.608 0 0.105 0.0472 0.001430 0.0907 0.732 89.976 211667 medium

9 rows × 24 columns

Note: Even though we counted the unique values for each column it wasn't until now that I realized tracks will repeat themselves in the data. In terms of understanding Valence, I'm a little skeptical, I would say Cheap Thrills by Sia is one of the happiest songs I know!
In [34]:
df[df['track_artist'] == 'Lady Gaga']['track_name'].unique()
Out[34]:
array(['Poker Face', 'Applause', 'Judas', 'The Cure', 'Bad Romance',
       'Just Dance', 'Paparazzi', 'Alejandro', 'Shallow - Radio Edit',
       'Swine', 'Shallow'], dtype=object)
In [35]:
df[df['track_artist'] == 'Sia']['track_name'].unique()
Out[35]:
array(['Heaven Can Wait - The Aston Shuffle Remix',
       'No New Friends - Dombresky Remix',
       'Heaven Can Wait (feat. Sia, Diplo, and Labrinth)',
       'Cheap Thrills', 'Chandelier', 'The Greatest', 'Elastic Heart',
       'Never Give Up', 'Mountains', 'Audio',
       'No New Friends (feat. Sia, Diplo, and Labrinth)',
       'Cheap Thrills Remix', 'Breathe Me', 'Original (from Dolittle)',
       'Move Your Body - Alan Walker Remix'], dtype=object)
In [36]:
df[df['track_artist'] == 'Sia']['playlist_genre'].value_counts()
Out[36]:
playlist_genre
pop      19
edm       8
latin     6
r&b       5
Name: count, dtype: int64
In [37]:
df[df['track_artist'] == 'Lady Gaga']['playlist_genre'].value_counts()
Out[37]:
playlist_genre
pop      14
edm       5
latin     3
r&b       2
rap       1
Name: count, dtype: int64
In [38]:
df[df['track_artist'] == 'Sia']['playlist_subgenre'].value_counts()
Out[38]:
playlist_subgenre
electropop                   9
dance pop                    6
pop edm                      6
latin hip hop                5
hip pop                      4
post-teen pop                3
indie poptimism              1
latin pop                    1
urban contemporary           1
electro house                1
progressive electro house    1
Name: count, dtype: int64
In [39]:
df[df['track_artist'] == 'Lady Gaga']['playlist_subgenre'].value_counts()
Out[39]:
playlist_subgenre
post-teen pop         10
pop edm                5
latin hip hop          3
electropop             2
dance pop              1
indie poptimism        1
southern hip hop       1
urban contemporary     1
hip pop                1
Name: count, dtype: int64
B.f.i. Categorical-to-continuous relationships.¶
In [40]:
sia_vs_gaga_df = df[(df['track_artist'] == 'Sia') | (df['track_artist'] == 'Lady Gaga')]

plt.figure(figsize=(15, 5))
plot = sns.histplot(data=sia_vs_gaga_df, x='speechiness', hue='track_artist', kde=True)
plot.set_xlabel('Speechiness')
plot.grid(True)

plot.legend_.set_title('Artist')
No description has been provided for this image
In [41]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=sia_vs_gaga_df, x='danceability', hue='track_artist', kde=True)
plot.set_xlabel('Danceability')
plot.grid(True)

plot.legend_.set_title('Artist')
No description has been provided for this image
In [42]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=sia_vs_gaga_df, x='energy', hue='track_artist', kde=True)
plot.set_xlabel('Energy')
plot.grid(True)

plot.legend_.set_title('Artist')
No description has been provided for this image
In [43]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=sia_vs_gaga_df, x='key', hue='track_artist', kde=True)
plot.set_xlabel('Key')
plot.grid(True)

plot.legend_.set_title('Artist')
No description has been provided for this image
In [44]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=sia_vs_gaga_df, x='tempo', hue='track_artist', kde=True)
plot.set_xlabel('Tempo')
plot.grid(True)

plot.legend_.set_title('Artist')
No description has been provided for this image
Note: And there's my justification, Sia's speechiness and tempo have a greater range than Lady Gaga's and that, to me, is an indicator of her diversity while Lady Gaga just sounds the same across her music. Maybe I suffer from confirmation bias!
In [45]:
sns.catplot(data=sia_vs_gaga_df, x='track_artist', y='tempo', kind='box');
No description has been provided for this image
In [46]:
sns.catplot(data=sia_vs_gaga_df, x='track_artist', y='tempo', kind='violin');
No description has been provided for this image
In [47]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=df, x='tempo', kde=True)
plot.set_xlabel('Tempo (BPM)')
plot.set_ylabel('Number of Tracks')
plot.grid(True)
No description has been provided for this image
In [48]:
plt.figure(figsize=(15, 5))
plot = sns.histplot(data=df, x='duration_ms', kde=True)
plot.set_xlabel('Duration (ms)')
plot.set_ylabel('Number of Tracks')
plot.grid(True)
No description has been provided for this image
B.f.i. Categorical-to-categorical relationships (COMBINATIONS)¶
In [49]:
plt.figure(figsize=(15, 5))
sns.countplot(x='playlist_genre', hue='_danceability_categories', data=df)
plt.title('Playlist Genre and General Danceability Score')
plt.grid(True)
plt.show()
No description has been provided for this image
Note: I'm suprised edm doesn't have more "high" danceable songs, though they do have the highest count of medium danceable songs.
B.f.iii. Continuous-to-continuous relationships.¶
In [50]:
sns.pairplot(data=df[primary_continuous_variables[0:5]]);
No description has been provided for this image
In [51]:
sns.pairplot(data=df[primary_continuous_variables[5:-1]]);
No description has been provided for this image
In [52]:
corr = df[['track_popularity','danceability','energy','speechiness','valence','liveness','acousticness','instrumentalness','loudness','tempo','duration_ms']].corr()
corr
Out[52]:
track_popularity danceability energy speechiness valence liveness acousticness instrumentalness loudness tempo duration_ms
track_popularity 1.000000 0.064748 -0.109112 0.006819 0.033231 -0.054584 0.085159 -0.149872 0.057687 -0.005378 -0.143682
danceability 0.064748 1.000000 -0.086073 0.181721 0.330523 -0.123859 -0.024519 -0.008655 0.025335 -0.184084 -0.096879
energy -0.109112 -0.086073 1.000000 -0.032150 0.151103 0.161223 -0.539745 0.033247 0.676625 0.149951 0.012611
speechiness 0.006819 0.181721 -0.032150 1.000000 0.064659 0.055426 0.026092 -0.103424 0.010339 0.044603 -0.089431
valence 0.033231 0.330523 0.151103 0.064659 1.000000 -0.020560 -0.016845 -0.175402 0.053384 -0.025732 -0.032225
liveness -0.054584 -0.123859 0.161223 0.055426 -0.020560 1.000000 -0.077243 -0.005507 0.077613 0.021018 0.006138
acousticness 0.085159 -0.024519 -0.539745 0.026092 -0.016845 -0.077243 1.000000 -0.006850 -0.361638 -0.112724 -0.081581
instrumentalness -0.149872 -0.008655 0.033247 -0.103424 -0.175402 -0.005507 -0.006850 1.000000 -0.147824 0.023335 0.063235
loudness 0.057687 0.025335 0.676625 0.010339 0.053384 0.077613 -0.361638 -0.147824 1.000000 0.093767 -0.115058
tempo -0.005378 -0.184084 0.149951 0.044603 -0.025732 0.021018 -0.112724 0.023335 0.093767 1.000000 -0.001412
duration_ms -0.143682 -0.096879 0.012611 -0.089431 -0.032225 0.006138 -0.081581 0.063235 -0.115058 -0.001412 1.000000
In [53]:
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, annot_kws={'size': 10}, fmt=".2f", cmap='coolwarm',
            cbar_kws={'shrink': .8}, linewidths=.5, cbar=False)
plt.title('Correlation Heatmap')
plt.show()
No description has been provided for this image
B.g. You must decide appropriate visualizations to show if RELATIONSHIPS change across GROUPS. This is especially important to continuous-to-continuous relationships. Do not forget though that categorical-to-continuous relationships can also be GROUPED BY a secondary categorical variable!¶
B.h.i. Visualize the relationship between the continuous response and the continuous inputs using scatter plots and trend plots.¶
In [54]:
fc = sns.FacetGrid(df, col='playlist_genre', col_wrap=3)
fc.map(sns.scatterplot, 'track_popularity', 'danceability')
fc.add_legend()
plt.show()
No description has been provided for this image
In [55]:
fc = sns.FacetGrid(df, col='playlist_genre', col_wrap=3)
fc.map(sns.scatterplot, 'valence', 'danceability')
fc.add_legend()
plt.show()
No description has been provided for this image
Note: The rock category does seem to provide a slighly more correlated relationship between danceability and valence, though nothings popping out at me.
In [56]:
plt.figure(figsize=(15, 5))
plot = sns.lineplot(x='track_popularity', y='danceability', data=df)
plot.grid(True)
No description has been provided for this image
In [57]:
sns.catplot(data=df, x='track_popularity', y='danceability', kind='box', aspect=5);
No description has been provided for this image
In [58]:
plt.figure(figsize=(15, 5))
plot = sns.lineplot(x='valence', y='danceability', data=df)
plot.grid(True)
No description has been provided for this image
In [59]:
plt.figure(figsize=(15, 5))
plot = sns.lineplot(x='acousticness', y='energy', data=df)
plot.grid(True)
No description has been provided for this image
In [60]:
plt.figure(figsize=(15, 5))
plot = sns.lineplot(x='loudness', y='energy', data=df)
plot.grid(True)
No description has been provided for this image
In [61]:
sns.catplot(data=df[df['track_popularity'] < 25], x='track_popularity', y='danceability', kind='box', hue='playlist_genre', aspect=5);
No description has been provided for this image
In [62]:
sns.catplot(data=df[(df['track_popularity'] > 25) & (df['track_popularity'] < 50)], x='track_popularity', y='danceability', kind='box', hue='playlist_genre', aspect=5);
No description has been provided for this image
In [63]:
sns.catplot(data=df[(df['track_popularity'] > 50) & (df['track_popularity'] < 75)], x='track_popularity', y='danceability', kind='box', hue='playlist_genre', aspect=5);
No description has been provided for this image
In [64]:
sns.catplot(data=df[df['track_popularity'] > 75], x='track_popularity', y='danceability', kind='box', hue='playlist_genre', aspect=5);
No description has been provided for this image
Note: Since there were too many data points, I split them into 4 equal sections and used boolean indexing to provide the filters.
B.h.ii. Group the relationships by categorical variables (if appropriate).¶
In [65]:
sns.pairplot(data=df[['track_popularity', 'danceability', 'energy', 'valence', 'playlist_genre']], hue='playlist_genre')
plt.show()
No description has been provided for this image
In [66]:
sns.pairplot(data=df[['tempo', 'danceability', 'playlist_genre']], hue='playlist_genre', height=3.5, aspect=1)
plt.show()
No description has been provided for this image
B.h.iii. Summarize the response for each unique value of the categorical inputs using boxplots, violin plots, and point plots.¶
In [67]:
sns.catplot(data=df, y='track_popularity', kind='box', hue='playlist_genre', aspect=3, palette='pastel');
No description has been provided for this image
In [68]:
sns.catplot(data=df, y='track_popularity', kind='violin', hue='playlist_genre', aspect=3, palette='pastel');
No description has been provided for this image
In [69]:
sns.catplot(data=df, y='danceability', kind='box', hue='playlist_genre', aspect=3, palette='pastel');
No description has been provided for this image
In [70]:
sns.catplot(data=df, y='danceability', kind='violin', hue='playlist_genre', aspect=3, palette='pastel');
No description has been provided for this image
In [71]:
sns.catplot(data=df, y='track_popularity', kind='box', hue='playlist_subgenre', aspect=3, palette='pastel');
No description has been provided for this image
In [72]:
sns.catplot(data=df, y='track_popularity', kind='violin', hue='playlist_subgenre', aspect=3, palette='pastel');
No description has been provided for this image
In [73]:
sns.catplot(data=df, y='valence', kind='box', hue='playlist_genre', aspect=3, palette='pastel');
No description has been provided for this image
In [74]:
sns.catplot(data=df, y='valence', kind='violin', hue='playlist_genre', aspect=3, palette='pastel');
No description has been provided for this image
Note: Looks like Latin playlists have the highest concentration of happy songs. Me encanta la musica.
In [75]:
sns.pointplot(data=df, x='playlist_genre', y='track_popularity', palette='dark')
Out[75]:
<Axes: xlabel='playlist_genre', ylabel='track_popularity'>
No description has been provided for this image
Note: There is a statistically significant difference in track popularity between their genres (but not for all genres). Interesting.
In [76]:
sns.pointplot(data=df, x='playlist_genre', y='tempo', palette='dark')
Out[76]:
<Axes: xlabel='playlist_genre', ylabel='tempo'>
No description has been provided for this image
In [77]:
sns.pointplot(data=df, x='playlist_genre', y='valence', palette='dark')
Out[77]:
<Axes: xlabel='playlist_genre', ylabel='valence'>
No description has been provided for this image
In [78]:
plt.figure(figsize=(12, 6))
ax = sns.pointplot(data=df, x='playlist_subgenre', y='valence', palette='dark')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image
C.a. Select variables. You MUST state which class of variables you used in your cluster analysis. For example, you could:¶
C.a.i. Cluster using CONTINUOUS inputs.¶
Answer: tempo and valence for the first cluster, track_popularity and danceability for the second cluster.
C.b.i How many variables are you using relative to the number of observations?¶
Answer: Roughly 1 variable for every 16416 observations
C.b.ii Are the variables you are using “Gaussian-like” or are the distributions “odd looking”? You must visualize the MARGINAL histograms of the variables IF you are using summary statistics derived from INPUTS or OUTPUTS. If you are using the INPUTS or OUTPUTS directly, you already visualized the MARGINAL distributions.¶
Answer: Already visualized, refer to section B.e.i. I would say valence is the most normally distributed out of the four: valence, tempo, track_popularity, and danceability.
C.b.iii Are the variables you are using highly correlated to each other? You must visualize the relationships between the variables you are using. If you are using the INPUTS/OUTPUTS directly then you already visualized those relationships.¶
Answer: Already visualized, refer to section B.f.iii. They are not highly correlated, suprisingly.
C.b.iv How should you handle missing values? If you drop all rows with at least one missing value, how many observations would you still have (thus, how many complete cases do you have)?¶
In [79]:
df[['track_popularity', 'danceability', 'tempo', 'valence']].isna().sum()
Out[79]:
track_popularity    0
danceability        0
tempo               0
valence             0
dtype: int64
Answer: No missing values, would have dropped the row entirely if there were.
C.d.i First, use 2 clusters and Count the number of observations per identified cluster. Are the clusters balanced?¶
In [80]:
k = 2

kmeans = KMeans(n_clusters=k, random_state=0)
df['tempo_valence_cluster'] = kmeans.fit_predict(df[['tempo', 'valence']])
df['track_popularity_danceability_cluster'] = kmeans.fit_predict(df[['track_popularity', 'danceability']])
In [81]:
visualize_categories('tempo_valence_cluster', 'Tempo Valence Cluster')
No description has been provided for this image
In [82]:
visualize_categories('track_popularity_danceability_cluster', 'Track Popularity and Danceability Cluster')
No description has been provided for this image
Note: Previously defined function is used again, let's go!! My first cluster is more balanced than my second, but neither seem to be perfectly balanced.
C.d.ii. Visualize the cluster results between 2 of the variables used for the cluster analysis¶
In [83]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='tempo', y='valence', hue='tempo_valence_cluster', data=df, palette='viridis', style='tempo_valence_cluster', s=100)

kmeans.fit_predict(df[['tempo', 'valence']])
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')

plt.title('KMeans Clustering of Tempo and Valence')
plt.xlabel('Tempo')
plt.ylabel('Valence')
plt.legend(title='Cluster', loc='best')

plt.show()
No description has been provided for this image
In [84]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='track_popularity', y='danceability', hue='track_popularity_danceability_cluster', data=df, palette='viridis', style='track_popularity_danceability_cluster', s=100)


kmeans.fit_predict(df[['track_popularity', 'danceability']])
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')

plt.title('KMeans Clustering of Track Popularity and Danceability')
plt.xlabel('Track Popularity')
plt.ylabel('Danceability')
plt.legend(title='Cluster', loc='best')

plt.show()
No description has been provided for this image
C.e. Then, identify the optimal number of clusters, and rerun KMeans for the optimal number.¶
In [85]:
inertias = []
clusters_range = range(1, 11)

for k in clusters_range:
    kmeans = KMeans(n_clusters=k, random_state=0).fit(df[['tempo', 'valence']])
    inertias.append(kmeans.inertia_)
In [86]:
plt.figure(figsize=(15, 5))
plt.plot(clusters_range, inertias, marker='o')
plt.xlabel('Number of clusters')
plt.xticks(clusters_range)
plt.grid(True)
plt.show()
No description has been provided for this image
In [87]:
inertias = []
clusters_range = range(1, 11)

for k in clusters_range:
    kmeans = KMeans(n_clusters=k, random_state=0).fit(df[['track_popularity', 'danceability']])
    inertias.append(kmeans.inertia_)
In [88]:
plt.figure(figsize=(15, 5))
plt.plot(clusters_range, inertias, marker='o')
plt.xlabel('Number of clusters')
plt.xticks(clusters_range)
plt.grid(True)
plt.show()
No description has been provided for this image
In [89]:
k = 5

kmeans = KMeans(n_clusters=k, random_state=0)
df['tempo_valence_cluster'] = kmeans.fit_predict(df[['tempo', 'valence']])
df['track_popularity_danceability_cluster'] = kmeans.fit_predict(df[['track_popularity', 'danceability']])
C.e.i. Count the number of observations per identified cluster.¶
In [90]:
visualize_categories('tempo_valence_cluster', 'Tempo Valence Cluster')
No description has been provided for this image
In [91]:
visualize_categories('track_popularity_danceability_cluster', 'Track Popularity Danceability Cluster')
No description has been provided for this image
C.e.ii. Visualize the cluster results between 2 of the variables used for the cluster analysis.¶
In [92]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='tempo', y='valence', hue='tempo_valence_cluster', data=df, palette='viridis', style='tempo_valence_cluster', s=100)

kmeans.fit_predict(df[['tempo', 'valence']])
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')

plt.title('KMeans Clustering of Tempo and Valence')
plt.xlabel('Tempo')
plt.ylabel('Valence')
plt.legend(title='Cluster', loc='best')

plt.show()
No description has been provided for this image
In [93]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='track_popularity', y='danceability', hue='track_popularity_danceability_cluster', data=df, palette='viridis', style='track_popularity_danceability_cluster', s=100)

kmeans.fit_predict(df[['track_popularity', 'danceability']])
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')

plt.title('KMeans Clustering of Track Popularity and Danceability')
plt.xlabel('Track Popularity')
plt.ylabel('Danceability')
plt.legend(title='Cluster', loc='best')

plt.show()
No description has been provided for this image
C.f.i.2. If you are working on a REGRESSION problem, are the identified clusters consistent with categories of a categorical input?¶
In [94]:
cross_tab = pd.crosstab(df['track_popularity_danceability_cluster'], df['playlist_genre'])

sns.heatmap(cross_tab, annot=True, cmap='coolwarm', fmt='g')
plt.title('Heatmap of Cluster vs Category')
plt.show()
No description has been provided for this image
In [95]:
sns.catplot(data=df[df['track_popularity'] > 75], x='track_popularity', y='danceability', kind='box', hue='playlist_genre', aspect=5);
No description has been provided for this image
In [96]:
sns.catplot(data=df, y='track_popularity', kind='violin', hue='playlist_genre', aspect=3, palette='pastel');
No description has been provided for this image
In [97]:
plot = sns.displot(data=df, x='track_popularity', hue='playlist_genre', kind='kde', height=5, aspect=3)

plot.axes[0, 0].set_xlabel('track_popularity')
plot.axes[0, 0].grid(True)
plot._legend.set_title('Playlist Genre')

plt.show()
No description has been provided for this image
In [98]:
sns.pointplot(data=df, x='playlist_genre', y='track_popularity', palette='dark')
Out[98]:
<Axes: xlabel='playlist_genre', ylabel='track_popularity'>
No description has been provided for this image
Answer: Yes, we can see that the 0 cluster, focusing on the most popular tracks, are composed of a higher number of Pop and Latin genre tracks. Seen above by a few visualizations, especially the point plot.
C.f.ii.1. Use appropriate visualizations to study the CONDITIONAL distributions of the variables used for clustering GROUPED BY the identified clusters.¶
In [99]:
fg = sns.FacetGrid(df, col="tempo_valence_cluster", col_wrap=5, height=3)
fg.map(plt.hist, 'playlist_genre', bins=20, color='m')
plt.show()
No description has been provided for this image
In [100]:
fg = sns.FacetGrid(df, col="track_popularity_danceability_cluster", col_wrap=5, height=3)
fg.map(plt.hist, 'playlist_genre', bins=20, color='m')
plt.show()
No description has been provided for this image
A.a. Are you working on a REGRESSION or CLASSIFICATION problem?¶
Answer: Classification. I think I can classify which playlist subgenre a song will belong to based on features.
A.b. Which variables are inputs?¶
Answer: ['mode', 'playlist_genre', 'key', 'playlist_subgenre', 'track_popularity', 'danceability', 'energy', 'speechiness', 'valence', 'liveness', 'acousticness', 'instrumentalness', 'loudness', 'tempo', 'duration_ms'] to start, will look at feature permutation to see which I can toss out.
A.c. Which variables are responses/outputs/outcomes/targets?¶
Answer: playlist_subgenre
A.d. Did you need to DERIVE the responses of interest by SUMMARIZING the available data?¶
Answer: Certainly helped, probably could have chosen by eyeballing it though.
A.e. If so, what summary actions did you perform?¶
Answer: Descriptive stats, visualizations, counted missing values, correlation analysis, etc.
A.e. Which variables are identifiers and should NOT be used in the models?¶
Answer: Columns not in all_primaries something like: track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id
A.g.i. Which of the inputs do you think influence the response, based on your exploratory visualizations? Which exploratory visualization helped you identify potential input-to-output relationships? (If you are not sure which inputs seem to influence the response, it is okay to say so.)¶
Answer: Tempo is gonna be a big factor in determining which playlist subgenre it belongs to. I didn't decide on a classification model for subgenre until after I did the EDA, which is why this section is at the bottom. If I redid my EDA I would primarily focus on subgenre, but arriving at that decision is partly what EDA is for.
In [101]:
plt.figure(figsize=(12, 6))
ax = sns.pointplot(data=df, x='playlist_subgenre', y='danceability', palette='dark')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [102]:
plt.figure(figsize=(12, 6))
ax = sns.pointplot(data=df, x='playlist_subgenre', y='energy', palette='dark')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [103]:
plt.figure(figsize=(12, 6))
ax = sns.pointplot(data=df, x='playlist_subgenre', y='instrumentalness', palette='dark')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image
Time Required to Complete this Notebook - approx. 15 hours
Time Required to Run this Notebook - approx. 2 minutes
End
In [ ]: